Scalable cloud storage architecture

ABSTRACT

a virtual storage module operable to run in a virtual machine monitor may include a wait-queue operable to store incoming block-level data requests from one or more virtual machines. In-memory metadata may store information associated with data stored in local persistent storage that is local to a host computer hosting the virtual machines. The data stored in local persistent storage replicates a subset of data in one or more virtual disks provided to the virtual machines. The virtual disks are mapped to remote storage accessible via a network connecting the virtual machines and the remote storage. A cache handling logic may be operable to handle the block-level data requests by obtaining the information in the in-memory metadata and making I/O re-quests to the local persistent storage or the remote storage or combination of the local persistent storage and the remote storage to service the block-level data requests.

FIELD

The present application generally relates to computer systems andcomputer storage, and more particularly to virtual storage and storagearchitecture.

BACKGROUND

Designing a storage system is a challenging task. For instance, in CloudComputing, high degree of virtualization increases the demand forstorage spaces and this requires the use of remote storage spaces.However, uncontrolled access to the remote storage from large number ofvirtual machines can easily saturate the networking infrastructure andaffect the entire systems using the network.

More particularly, for example, in an IaaS (Infrastructure-as-a-Service)cloud services, storage needs of VM (Virtual Machine) instances are metthrough virtual disks (i.e. virtual block devices). However, it isnontrivial to provide virtual disks to VMs in an efficient and scalableway for a couple of reasons. First, a VM host may be required to providevirtual disks for a large number of VMs. It is difficult to ascertainthe largest possible storage demands and physically provision them allin the host machine. On the other hand, if the storage spaces forvirtual disks are provided through remote storage servers, aggregatenetwork traffic due to storage accesses from VMs can easily deplete thenetwork bandwidth and cause congestion.

BRIEF SUMMARY

A storage system and method for handling data for virtual machines, forinstance, for scalable cloud storage architecture, may be provided. Thesystem, in one aspect, may include a virtual storage module operable torun in a virtual machine monitor. The virtual storage module may includea wait-queue operable to store incoming block-level data requests fromone or more virtual machines, and in-memory metadata for storinginformation associated with data stored in local persistent storage thatis local to a host computer hosting the virtual machines. The datastored in local persistent storage may be replication of a subset ofdata in one or more virtual disks provided to the virtual machines, thevirtual disks being mapped to remote storage accessible via a networkconnecting the virtual machines and the remote storage. A cache handlinglogic may be operable to handle the block-level data requests byobtaining the information in the in-memory metadata and making I/Orequests to the local persistent storage or the remote storage orcombination of the local persistent storage and the remote storage toservice the block-level data requests.

A method for handling data storage for virtual machines, in one aspect,may include intercepting one or more incoming block-level data requestsreceived by a virtual machine monitor from one or more virtual machines.The method may also include obtaining from in-memory metadata,information associated with data of the block-level data request. Thein-memory metadata may store information associated with data stored inlocal persistent storage that is local to a host computer hosting thevirtual machines. The data stored in local persistent storage may bereplication of a subset of data in one or more virtual disks provided tothe virtual machines. The virtual disks may be mapped to remote storageaccessible via a network connecting the virtual machines and the remotestorage. The method may further include making I/O requests to the localpersistent storage or the remote storage or combination of the localpersistent storage and the remote storage to service the block-leveldata requests.

A computer readable storage medium storing a program of instructionsexecutable by a machine to perform one or more methods described hereinalso may be provided.

Further features as well as the structure and operation of variousembodiments are described in detail below with reference to theaccompanying drawings. In the drawings, like reference numbers indicateidentical or functionally similar elements.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

FIG. 1 shows the architecture of a scalable Cloud storage system in oneembodiment of the present disclosure.

FIG. 2 shows the architecture of vStore in one embodiment of the presentdisclosure.

FIG. 3 illustrates structure of one cache entry in one embodiment of thepresent disclosure.

FIG. 4A is a flow diagram illustarting a read request handling in oneembodiment of the present disclosure.

FIG. 4B is a flow diagram illustarting a write request handling in oneembodiment of the present disclosure.

FIG. 5 shows as an example, the Xen implementation of vStore in oneembodiment of the present disclsoure.

DETAILED DESCRIPTION

The present disclosure in one embodiment presents a system (referred toin this disclosure as vStore), which utilizes the host's (e.g., computerserver hosting virtual machines) local disk space as a block-level cachefor the remote storage (e.g., network attached storages), for example,in order to absorb network traffics from storage accesses. This allowsthe VMM (Virtual Machine Monitor, a.k.a. hypervisor) to serve VMs' diskinput/output (I/O) requests from the host's local disks most of thetime, while providing the illusion of much larger storage space forcreating new virtual disks. Caching virtual disks at block-level posesspecial challenges in achieving high performance while maintainingvirtual disk semantics. First, after a disk write operation finishesfrom the VM's perspective, the data should survive even if the hostimmediately encounters a power failure. That is, the block-level cacheshould preserve the data integrity in the event of host crashes. To thatend, cache handling operations in one embodiment of the presentdisclosure may ensure consistency between on-disk metadata and data toavoid committing incorrect data to the network attached storage (NAS)during recovery from a crash, while minimizing overheads in updatingon-disk metadata. Second, as disk I/O performance is dominated by diskseek times, a virtual disk should be kept as sequential as possible inthe limited cache space. Unlike memory-based caching schemes, theperformance of an on-disk cache is highly sensitive to data layout. Thepresent disclosure in one embodiment may utilize a cache placementpolicy that maintains a high degree of data sequentiality in the cacheas in the original (i.e., remote) virtual disk. Third, the destagingoperation that sends dirty pages back to the remote storage server maybe self-adaptive and minimize the impact on the foreground traffic.

In another aspect, a scalable architecture is presented that providesreliable virtual disks (i.e., block devices as opposed to object stores)for virtual machines (VM) in a cloud environment.

FIG. 1 shows the architecture of a scalable Cloud storage system in oneembodiment of the present disclosure. The architecture may include oneor more VM-hosting machines (e.g., 102, 104, 106). A VM-hosting machineis a physical machine that hosts a large number of VMs and has limitedlocal storage space. vStore 108 uses local storage 110 as a block-levelcache and provides to VMs 112 the illusion of unlimited storage space.vStore 108 may be implemented in hypervisor 114 and provides persistentcache. vStore 108 performs caching at the block device level rather thanthe file system level. The hypervisor 114 executes on one or morecomputer processors and provides a virtual block device to VMs 112,which implies that VMs 112 see raw block devices and they are free toinstall any file systems on top of it. Thus, hypervisor 114 receivesblock-level requests and redirects it to the remote storage (e.g., 116,118).

In one embodiment, single cache space is provided per machine (e.g.,102). The cache tries to replicate the block layout of remote storage(e.g., 116, 118) in the local cache space (local disk) 110.

Storage server clusters (e.g., 116, 118) provide network attachedstorage to physical machines (e.g., 102, 104, 106). They (e.g., 116,118) can be either dedicated high-performance storage servers or acluster of servers using commodity storage devices. The interface to thehypervisors 114 can be either block-level or file-level. If it is theblock-level, iSCSI type of protocol can be used between storage serversand clients (i.e., hypervisors). If it is file-level, the hypervisormounts a remote directory structure and keeps the virtual disks asindividual files. Regardless of the protocol between hypervisors andstorage servers, the interface between VMs and hypervisor remains atblock-level.

The directory server 120 holds the location information about thestorage server clusters. When a hypervisor 114 wants to attach a virtualdisk to a VM, it consults the directory server 120 to determine theaddress of a specific storage server (e.g., 116, 118) that currentlystores the virtual disk.

The architecture also includes networking infrastructure. Usuallynetwork bandwidth within a rack is well-provisioned, but cross-racknetwork is usually 5-10 times under-provisioned than that of within-racknetwork. As a result, uncontrolled storage accesses from VMs can easilydeplete the network bandwidth and cause congestion.

An example configuration may have rack-mounted servers for hostingvirtual machines and remote storage servers to provide storage servicesto the VMs. A rack may contain more than 20 servers and virtual machinemonitors such as Xen-3.1.4 hypervisor installed on each of them. Serversmay have processors such as two Intel® Xeon™ CPU of 3.40 GHz and havememory, e.g., 2 giga (G) bytes of memory. They can communicate through 1Gbps link within the rack. Local storage for each server may be about 1terabytes and they have a network file system (NFS)-mounted sharedstorage space that is used to hold VM images for all Virtual Machines.Remote storage servers may have physical hard disks attached, e.g.,through Serial Advanced Technology Attachment (SATA) interface.

There may be multiple options when designing a storage system for aCloud. One solution is to use only local storage. In a Cloud, VMs mayuse different amounts of storage space, depending on how much the userpays. If every host's local storage space is over-provisioned for thelargest possible demand, the cost would be prohibitive. Another solutionis to only use network attached storage. That is, a VM's root filesystem, swap area, and additional data disks are all stored on networkattached storage. This solution, however, would incur a large amount ofnetwork traffic and disk I/O load on the storage servers.

Sequential disk access can achieve a data rate of 100 MB/s. Even withpure random access, it can reach 10 MB/s. Since 1 Gbps network cansustain roughly about 13 MB/s, four uplinks to the rack-level switch arenot enough to handle even one single sequential access. Note thatuplinks to the rack-level network switches are limited in numbers andcannot be easily increased in commodity systems. Even for random diskaccess, it can only support about five VMs' disk I/O traffic. Even with10 Gbps networks, it still can hardly support thousands of VMs runningin one rack (e.g., typical numbers are 42 hosts per rack, and 32 VMs perhost, i.e., 1,344 VMs per rack).

vStore 108 takes a hybrid approach that leverages both local storage 110and network attached storage 116, 118. It still relies on networkattached storage 116, 118 to provide sufficient storage space for VMs112, but utilizes the local storage 110 of a host 102 to cache data andavoid accessing network attached storage 116, 118 as much as possible.

Consider the case of Amazon EC2, where a VM is given one 10 GB virtualdisk to store its root file system and another 160 GB virtual disk tostore data. The root disk can be stored on local storage due to itssmall size. The large data disk can be stored on network attachedstorage and accessed through the vStore cache. Data integrity andperformance are two main challenges in the design of vStore. After adisk write operation finishes from the VM's perspective, the data shouldsurvive even if the host immediately encounters a power failure. InvStore, system failures can compromise data integrity in several ways.If the host crashes while vStore is in the middle of updating either themetadata or the data and there is no mechanism for detecting theinconsistency between the metadata and the data, after the hostrestarts, incorrect data may remain in the cache and be written back tothe network attached storage. Another case that may compromise dataintegrity is through violating the semantics of writes. If data isbuffered in memory and not flushed to disk after reporting writecompletion to the VM, a system crash will cause data loss. Taking suchsemantics in consideration vStore of the present disclosure in oneembodiment may be designed to support data integrity.

The second challenge is to achieve high performance, which conflictswith ensuring data integrity and hence may be designed to minimizeperformance penalties. The performance of vStore may be affected byseveral factors: (i) data placement within the cache, (ii) vStoremetadata placement on disk, (iii) complication introduced by the vStorelogic. For (i), if sequential blocks in a virtual disk are placed farapart in the cache, a sequential read of these blocks incurs a highoverhead due to a long disk seek time. Therefore, in one embodiment,vStore keeps a virtual disk as sequential as possible in the limitedcache space. For (ii), ideally, on-disk metadata should be small andshould not require an additional disk seek to access data and metadataseparately. For (iii), one potential overhead is the dependency amongoutstanding requests. For example, if one request is about to evict onecache entry, then all the requests on that entry must wait. All of thesefactors may be considered in the design of vStore.

FIG. 2 shows the architecture of vStore in one embodiment of the presentdisclosure. The description herein is based on para-virtualized Xen asan example. VMs 202 generate block requests in the form of (sectoraddress, sector count). Requests arrive at the front-end device driverwithin the VM 202 after passing through the guest kernel. Then they areforwarded to the back-end driver in Domain-0. The back-end driver issuesactual I/O requests to the device, and send responses to the guest VM202 along the reverse path.

In one embodiment, the vStore module 204 runs in Domain-0, and extendsthe function of the back-end device driver. vStore 204 interceptsrequests and filters them through its cache handling logic. In FIG. 2,vStore 204 internally may include a wait queue 206 for incomingrequests, a cache handling logic 208, and in-memory metadata 210.Incoming requests are first put into vStore's wait queue 206. The waitqueue 206 is used in one embodiment because the cache entry that thisrequest needs to use might be under eviction or update triggered byprevious requests. After clearing such conflicts, the request is handledby the cache handling logic 208. The in-memory metadata 210 areconsulted to obtain information such as block address, dirty bit, andmodification time. Depending on the current cache state, actual I/Orequests are made to either the cache on local storage 212 or thenetwork attached storage 214.

I/O Unit: Guest VMs usually operate on 4 KB blocks, but vStore canperform I/Os to and from the network attached storage at a configurablelarger unit. A large I/O unit reduces the size of in-memory metadata, asit reduces the number of cache entries to manage. Moreover, a large I/Ounit works well with high-end storage servers, which are optimized forlarge I/O sizes (e.g., 256 KB or even 1 MB). Thus, reading a large unitis as efficient as reading 4 KB. This may increase the incoming networktraffic, but our evaluation shows that the subsequent savings outweighthe initial cost. We use the term, block group, to refer to the I/O unitused by the vStore as opposed to the (typically 4 KB) block used by theguest VMs. That is, one block group contains one or more 4 KB blocks.

Metadata: Metadata holds information about cache entries on disk.Metadata are stored on disk for data integrity and cached in memory forperformance. Metadata updates are done in a write-through manner. Aftera host crashes and recovers, vStore visits each metadata entry on diskand recovers any dirty data that have not been flushed to networkattached storage. Table 1 summarizes examples of the metadata fields inone embodiment of the present disclosure.

TABLE 1 vStore Metadata. Fields Size Descriptions Virtual 2 Bytes IDassigned by vStore to uniquely identify a Disk ID virtual disk. An ID isunique only within individual hypervisors. Sector 4 Bytes Cache entry'sremote address in unit of Address sector. Dirty Bit 1 Bit Set if cachecontent is modified. Valid Bit 1 Bit Set if cache entry is being usedand the corresponding data is in the cache. Lock Bit 1 Bit Set if undermodification by a request. Read Count 2 Bytes How many read accesseswithin a time unit. Write Count 2 Bytes How many write accesses within atime unit. Bit Vector Variable Each bit represents 4 KB within the blockgroup. Set if corresponding 4 KB is valid. The size is (block group)/4KB bits. Access Time 8 Bytes Most recently accessed time. Total Size <23Bytes

Virtual Disk identifier (ID) identifies a virtual disk stored on networkattached storage. When a virtual disk is detached and reconnected later,cached contents that belong to this disk is identified and reused. BitVector has one bit for each 4 KB block in a block group so that thestates of 4 KB blocks in the same block group can be changed and trackedindividually. Without Bit Vector, the states of 4 KB blocks in the sameblock group must always be changed together. As a result, when the VMwrites to a 4 KB block, vStore must read the entire block group(including all 4 KB blocks in that block group) from network attachedstorage, merge with the 4 KB new data, and writes the entire block groupto cache. With Bit Vector, vStore can write to the 4 KB data directlywithout fetching the entire block group, and then only change theaffected 4 KB block's state in Bit Vector. Our experiments show that BitVector helps reduce network traffic when using a large cache unit size.

Maintaining metadata on disk may compromise performance. A naiveimplementation may require two disk accesses to handle one write requestissued by a VM—one for metadata update and one for writing actual data.In the present disclosure in one embodiment, vStore solves this problemby putting metadata and data together, and updates them in a singlewrite. The details are described below.

In-memory Metadata: To avoid disk I/Os for reading the on-disk metadata,vStore in one embodiment maintains a complete copy of the metadata inmemory and updates them in a write-through manner. One embodiment of thepresent disclosure use a large block group size (e.g., 256 KB) to reducethe size of the in-memory metadata.

Cache Structure: vStore in one embodiment of the present disclosureorganizes local storage as a set-associative cache with write-backpolicy by default. We describe the cache as a table-like structure,where a cache set is a column in the table, and a cache row is a row inthe table. A cache row includes multiple block groups. A block group hascontents coming from one virtual disk, but different block groups in thesame cache row may have contents coming from different virtual disks.Block groups in the same cache row are laid out in logically contiguousdisk blocks in one embodiment of the present disclosure.

FIG. 3 illustrates structure of one cache entry in one embodiment of thepresent disclosure. A block group includes n number of 4 kilobyte (KB)blocks and each 4 KB blocks have trailers. For instance, each 4 KB block302 in a block group 304 has a 512-byte trailer 306 shown in FIG. 3.This trailer 306 in one embodiment includes metadata 308 and the hashvalue 310 of the 4 KB data block 302. On a write operation, vStorecomputes the hash of the 4 KB block 302, and writes the 4 KB block 302and its 512-byte trailer 306 in a single write operation. If the hostcrashes during the write operation, after recovery, the hash value helpsdetect that the 4 KB block and the trailer are inconsistent. The 4 KBblock can be safely discarded, because the completion of the writeoperation has not been acknowledged to the VM yet. When handling a readrequest, vStore also reads the 512-byte trailer 306 together with the 4KB block 302. As a result, a sequential read of two adjacent blocksissued by the VM is also sequential in the cache. If only the 4 KB datablock is read without the trailer, the sequential request would bebroken into two sub-requests, spaced apart by 512 bytes.

Cache Replacement

In one aspect, simple policies like least recently used (LRU) and leastfrequently used (LFU) may not be suitable for vStore, because they aredesigned primarily for memory-based cache without consideration of blocksequentiality on disk. If two consecutive blocks in a virtual disk areplaced at two random locations in vStore's cache, sequential I/Orequests issued by the VM become random accesses on the physical disk.In one embodiment, vStore's cache replacement algorithm strives topreserve the sequentiality of a virtual disk's blocks.

Below, we describe an embodiment of vStore's cache replacement algorithmin detail. We introduce the concept of base cache row of a virtual disk.The base cache row is the default cache row on which the first row ofblocks of a virtual disk is placed. Subsequent blocks of the virtualdisk are mapped to the subsequent cache rows. For example, if there aretwo virtual disks Disk₁ and Disk₂ currently attached to the vStore andthe cache associativity is 5 (i.e., there are 5 cache rows), then Diskmight be assigned 1 as a base cache row and Disk₂ might be assigned 3 tokeep them reasonably away from each other. If we assume one cache row ismade of ten 128 KB cache groups, Disk₂'s block at address 1280K will bemapped to row 4 which is the next row from Disk₂'s base cache row.

Upon arrival of new data block, vStore in one embodiment determines thecache location in two steps. First, it looks at the cache entry's statewhose location is calculated using the base cache row and the block'saddress. If it is invalid or not dirty, then it is immediately assignedto the cache entry. If dirty, a victim entry is selected based on thescores. Six criteria may be used to calculate the score one embodiment.

-   -   Recentness—E.g., the more recently accessed, higher the score.    -   Prior Sequentiality—This measures how sequential the cache entry        is with respect to the adjacent cache entries. If the cache        entry is already sequential, then we prefer to keep it in one        embodiment.    -   Prior Distance—This measures how far away the cache entry is        from the default base cache row. If the entry is located in        cache row 2 and the default base cache row of the virtual disk        is 1, then the value is 2−1=1.    -   Posterior Sequentiality—This measures how sequential it will be        if we cache new block. If it becomes sequential, then we prefer        this cache entry as a victim.    -   Posterior Distance—This measures how far away from the default        base cache row it would be if we cache new block. If this        distance is far, it is less preferable.    -   Dirtiness—If the cache entry is modified, we would like to avoid        evicting this entry as much as possible.

Let x_(i) be each of the six criteria described above, e.g., for i=1 to6. A score may be computed using equation (1) as follows.

S=a ₀ ·x ₀ +a ₁ ·x ₁ + . . . +a ₅ ·x ₅  (1)

Here the coefficient a_(i) represents the weight of each criterion. Ifall a_(i) is 0 except for a₅, the eviction policy becomes equivalent toLRU. Weight coefficients are adjustable according to the preference. Inone embodiment, this value (score) is computed for all the cache entrywithin the cache set and the entry with the lowest score is chosen foreviction.

Cache Handling Operations

In one embodiment of the present disclosure, there may be three cases incache handling—cache hit, miss without flush and miss with flush. In oneembodiment, vStore design considers both performance and data integrityin its cache handling operations. Since vStore uses disk as a cachespace, cache handling has more disk access than when cache were notused. Excessive disk accesses may degrade the overall performance andreduce the merit of using vStore. In one embodiment of the presentdisclosure, disk accesses are minimized to make the performance losstolerable. vStore may address data integrity, in one embodiment asfollows. 512 byte trailer to each 4K blocks is added to record hash ofit. In order to minimize disk I/O in one embodiment of the presentdisclosure, we read and write the trailer together. This only increasesdata size, but does not increase the number of I/O. However, for cachemiss handling, additional disk I/O for data integrity may be introduced.In general, such consistency issue complicates overall cache handlingand there may be a trade-off between maintaining consistency andperformance penalty due to additional disk I/O.

FIG. 4A is a flow diagram illustarting a read request handling in oneembodiment of the present disclosure. FIG. 4B is a flow diagramillustarting a write request handling in one embodiment of the presentdisclosure.

READ Handling

FIG. 4A illustrates a flow diagram for read cache handling in oneembodiment of the present disclosure. At 402, a read request isreceived. The read request may originate from an application in a VM,for example to read data X. At 404, it is determined whether the blockgroup which stores the data of the read request is already cached. Forexample, the sector address of the read data is compared with thein-memory metatdata to determine whether the block group is cachedalready. If it is determined that the block group is cached, the flowlogic proceeds to 406, otherwise the flow logic proceeds to 420.

Using a virtual disk involves multiple steps: open the virtual disk,perform reads/writes, and finally close the virtual disk. When thevirtual disk was opened, vStore assigns a “Virtual Disk ID” to thevirtual disk and maps it to a remote disk on storage server (virtualdisk ID was described previously). This mapping relationship is kept ina mapping table, and stored both in memory and on disk in oneembodiment. When the VM issues a read request, vStore knows the VirtualDisk ID implicitly (because the request comes from a previously openedhandle) and the sector address is specified explicitly. Combining thevirtual disk ID and the sector address as one search key to look up thein-memory metadata can determine whether the data is cached and if sowhich block group currently caches the data. The following shows anexample data struc-ture of the combined search key.

Virtual 2 Bytes Disk ID Sector Address 4 Bytes

At 406, it is determined whether the 4 KB block corresponding to therequested read data, e.g., data X is cached. If so, at 408, local diskis read to retrieve the data. At 410, the data is returned to therequestor. If at 406, it is determined that parts of the requested readdata are cached while other parts are not cached (e.g., 1 KB in thecache and 3 KB on remote storage server), the cached block group fromthe local disk is read at 412. At 414, data corresponding the reqeustedread data is read from the remote disk and returned at 416. At 418, thelocally read data and the remotely read data are merged. The merged datais written to cache for later reuse on a cache hit.

At 404, if it is determined that the block group corresponding to therequested read data is not cached, the cache replacement algorithmchooses a location in the cache to hold the requested read data. At 420,it is determined whether the old data currently cached at that locationis dirty, i.e., the old data of that cache entry needs to be stored orupdated in the remote storage since that old data will be evicted fromthe cache. At 420, if the cache entry is not dirty, the requested readdata is read from the remote storage device at 422. The data is returnedat 424 and written to cache at 426.

At 420, if it is determined that the old data in the cache entry isdirty, at 428, Bit Vector is examined to determine whether the old datain the cache entry is partially valid, i.e., part of the data are storedin the cache while the other part are stored on the remote storageserver. Partial validity may be determined, for example, by reading thebit vector values for each of the 4 KB blocks in the block group. Forinstance, if a bit in the bit vector is 0, that part of the data is inlocal cache. If it is 1 that part of the data is on remote storage. Ifit is determined that the existing data in the cache entry is partiallyvalid, the corresponding data from the remote storage device is read at430. At 432, if the entire data of the cache entry is valid, the data isread from the local storage. At 434, the cache entry data is written toremote storate. If the cache entry data has partially valid data, theremotely read data (at 430) is merged with the locally read data (at432) before the data is written to the remote storage at 434. At 436,the requested read data is read from the remote storage. The read datais returned at 438 to the requestor (e.g., the application thatrequested it). At 440, the reqesuted read data retrieved from the remotestorage is written to cache. Here, the merge at 442 implies a wait foroperations on both incoming links (434, 438) to complete, beforeperforming the operation on the outgoing link (440). This is used, forexample, to gurantee data integrity or to wait for data from both lockdisk and remote storage.

A difference of read handling in FIG. 4A from write handling shown inFIG. 4B is that vStore can return the data as soon as it is availableand continue the rest of the cache operations in background. This isreflected in the miss handling operations (e.g., 420 to 440). Forexample, remote read (e.g., 422, 436) may be initiated first. As soon asvStore finishes reading the requested block, it returns with the data(e.g., 424, 438). On-disk metadata update and cache data write may beperformed afterwards (e.g., 426, 440).

WRITE Handling

FIG. 4B is a flow diagram illustarting a write request handling in oneembodiment of the present disclosure. At 450, write request (or command)is received to write data (e.g., data X). At 452, it is determinedwhether the block group to which the requested write data belongs, iscached, e.g., using virtual disk ID and sector number as the search keyto look up the in-memory metadata. At 454, if the data is cached, thedata is written to the local storage, i.e., cached. At 456, the processreturns, for instance, acknowledging successful write to the requestor.

At 458, if the block group is not cached, it is determined as to whetherthe block group is dirty, i.e., whether the data content of the blockgroup is modified. Whether the content of the block group is modifiedmay be determined from reading the metadata associated with the blockgroup and the values for the dirty bits of the 4 KB blocks containedtherein. At 460, if the content of the block group is determined to benot modified (i.e., not dirty), the requested write data is written tocache. At 462, the process returns, for instance, acknowledgingsuccessful write to the requestor.

If the content of the block group is modified, that data should bewritten out to the remote storage before the write data can overwritethe existing content of the block group. At 464, if the content of theblock group is dirty (modified), it is determined whether the currentcontent of the block group is partially valid. At 466, if the content isonly partially valid, the remotely stored data corresponding to thatcontent is read. This data may be merged with the current content of theblock group in the local storage in order to make the local block groupcontent wholely valid. At 468, the block group's content is read at 468.At 470, the content of the block group is written to the remote storage.At 472, the requested write data is written to cache at the location ofthe block group. At 474, the process returns, for instance,acknowledging successful write to the requestor.

For write requests, vStore in one embodiment directly writes the data tothe cache without accessing the network attached storage. Thissimplifies operations of cache hit and cache miss without flush. But,write handling for cache miss with flush may make several I/O requests.In FIG. 4B, the write handling returns at the end of entire operationsequences. In the worst case, write handling incurs at most four diskI/Os, which may occur in the case of cache miss with flush.

Destaging

Destaging refers to the process of flushing dirty (modified) data in thecache to the network attached storage. The destaging functionality inone embodiment of the present disclosure may be used to keep theproportion of dirty blocks under a specified level. Large number ofdirty blocks is potentially harmful to the performance because evictinga dirty cache entry delays the cache handling operations significantlydue to flushing operations. In addition, detachment of a virtual diskcan be faster when there are less number of dirty blocks. If a VM wantsto terminate or migrate, it has to detach the virtual disk. As part ofthe detachment process, all the dirty blocks belonging to the detachingstorage has to be flushed. Without destaging, the amount of data thathas to be transferred can be as large as orders of several gigabytes.Transferring that amount of data takes time and also generates burstytraffic.

Mechanism Design

In one embodimnet of the present disclosure, destaging may be triggeredwhen the number of dirty blocks in the cache exceeds the user-specifiedlevel, which we call the pollution level. For example, if the pollutionlevel is set to be 65%, it means that user wants to keep the ratio ofdirty blocks to total blocks below 65%.

Upon destaging, vStore in one embodiment may determine how many blocksto destage at a given time t. Basic idea in one embodiment is tomaintain a window size w_(t) which indicates the total allowed datatransmission size in unit of bytes per millisecond (Bpms). This windowsize is the combined data transmission size for both normal remotestorage accesses and the destaging. It is specified as a rate (Bpms)since destaging action can be fired at irregularly. If w_(t) increases,then may be more likely that normal network attached storage accesswould leave more bandwidth available for destaging.

Control technique for w_(t) in vStore may adopt the technique used forflow control in FAST TCP and for queue lengths adjustment. w_(t) may beadjusted using the network attached storage latency. Let R be thedesired network attached storage latency. Let R_(t) be the exponentiallyweighted moving average of observed network attached storage latencyexpressed as R_(t)=(1−α)R+αR_(t-1), where α is a smoothing factor. Wecalculate w_(t) using

$\begin{matrix}{w_{t} = {{( {1 - \gamma} )w_{t - 1}} + {\gamma \; \frac{R}{R_{t}}w_{t - 1}}}} & (2)\end{matrix}$

where γ is another smoothing factor for w_(t). If observed remotelatency is smaller than R, then w_(t) will increase and vice versa. InvStore, we also may consider the local latency denoted as v_(t).

If we let L_(t) be the latency of local disk, we calculate v_(t) as

$v_{t} = {{( {1 - \gamma} )v_{t - 1}} + {\gamma \; \frac{L}{L_{t}}{v_{t - 1}.}}}$

We take the minimum of w_(t) and v_(t) as the window size. Next wecalculate how many block groups to destage using determined window size.Let d_(t) denote the number of destage I/O to perform at time t, then

d _(t)=(min(v _(t) ,w _(t))×τ_(t) −C _(t))/B  (3)

where τ_(t) is time length between t and t−1 in millisec, B the blockgroup size and C_(t) pending I/O requests at time t in bytes. C_(t)represents the remote access from normal file system operations.Destaging may happen only if d_(t)>0.

vStore may be implemented using Xen's blktap interface. Xen is a virtualmachine montior. Virtual machine monitor, also referred to ashypervisor, allows guest operating systems to excute on the samecomputer hardware concurrently. Other virtual machine monitors may beused for implementing the vStore. FIG. 5 shows as an example, the Xenimplementation of vStore in one embodiment of the present disclsoure.Blktap mechanism redirects a VM's disk I/O requests to a tapdisk process508 running in the userspace of Domain-0. In a para-virtualized VM, userapplication 502 reads or writes to the blkfront device 504. Normallyblkfront connects to the blkback and all the block traffics aredelivered to it. If blktap 506 is enabled, blktap replaces blkback andall the block traffics are now redirected to the tapdisk process 508.Overall the blktap mechanism provides convenient method to interceptblock traffics and implement new functionalities in the user space.

Xen ships with several types of tapdisks so that tapdisk process canopen the block device using the specified disk type. Disk types aresimply a set of callback functions such as open, close, read, write, docallback and submit. Among several disk types, synchronous I/O type usesnormal read, write system calls to handle each incoming block I/Os.AIO-based disk type uses Linux AIO library to issue multiple blockrequests in a batch. vStore also may implement those predefined set ofcallback functions and registers to tapdisk as another type of tapdisk.vStore 510 may be based on the asynchronous I/O mechanism. For example,vStore submits requests to the Linux AIO library 512 and periodicallypolls for completed I/Os. Thus, internal structure of vStore 510 may bean event-driven architecture. A vStore also may be implemented usingsynchronous I/O in another embodiment.

In another aspect, the architecture of the present disclosure may alsoinclude cloud storage infrstructure which has features such as cacheblock transfer between VM hosts to support fast migration, replicationof cache blocks to nearby storage (possibly at higher level of hierarchyor same rack) within other hosts to support fast restart of VMs on afailed host, and an intelligent workload balancing mechanism betweenusing the local stroage and the remote storage for performance and/orcost optimization, e.g., a mechanism to dyanmically determine usingremote storage or local cache.

As will be appreciated by one skilled in the art, aspects of the presentinvention may be embodied as a system, method or computer programproduct. Accordingly, aspects of the present invention may take the formof an entirely hardware embodiment, an entirely software embodiment(including firmware, resident software, micro-code, etc.) or anembodiment combining software and hardware aspects that may allgenerally be referred to herein as a “circuit,” “module” or “system.”Furthermore, aspects of the present invention may take the form of acomputer program product embodied in one or more computer readablemedium(s) having computer readable program code embodied thereon.

Any combination of one or more computer readable medium(s) may beutilized. The computer readable medium may be a computer readable signalmedium or a computer readable storage medium. A computer readablestorage medium may be, for example, but not limited to, an electronic,magnetic, optical, electromagnetic, infrared, or semiconductor system,apparatus, or device, or any suitable combination of the foregoing. Morespecific examples (a non-exhaustive list) of the computer readablestorage medium would include the following: an electrical connectionhaving one or more wires, a portable computer diskette, a hard disk, arandom access memory (RAM), a read-only memory (ROM), an erasableprogrammable read-only memory (EPROM or Flash memory), an optical fiber,a portable compact disc read-only memory (CD-ROM), an optical storagedevice, a magnetic storage device, or any suitable combination of theforegoing. In the context of this document, a computer readable storagemedium may be any tangible medium that can contain, or store a programfor use by or in connection with an instruction execution system,apparatus, or device.

A computer readable signal medium may include a propagated data signalwith computer readable program code embodied therein, for example, inbaseband or as part of a carrier wave. Such a propagated signal may takeany of a variety of forms, including, but not limited to,electro-magnetic, optical, or any suitable combination thereof. Acomputer readable signal medium may be any computer readable medium thatis not a computer readable storage medium and that can communicate,propagate, or transport a program for use by or in connection with aninstruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmittedusing any appropriate medium, including but not limited to wireless,wireline, optical fiber cable, RF, etc., or any suitable combination ofthe foregoing.

Computer program code for carrying out operations for aspects of thepresent invention may be written in any combination of one or moreprogramming languages, including an object oriented programming languagesuch as Java, Smalltalk, C++ or the like and conventional proceduralprogramming languages, such as the “C” programming language or similarprogramming languages, a scripting language such as Perl, VBS or similarlanguages, and/or functional languages such as Lisp and ML andlogic-oriented languages such as Prolog. The program code may executeentirely on the user's computer, partly on the user's computer, as astand-alone software package, partly on the user's computer and partlyon a remote computer or entirely on the remote computer or server. Inthe latter scenario, the remote computer may be connected to the user'scomputer through any type of network, including a local area network(LAN) or a wide area network (WAN), or the connection may be made to anexternal computer (for example, through the Internet using an InternetService Provider).

Aspects of the present invention are described with reference toflowchart illustrations and/or block diagrams of methods, apparatus(systems) and computer program products according to embodiments of theinvention. It will be understood that each block of the flowchartillustrations and/or block diagrams, and combinations of blocks in theflowchart illustrations and/or block diagrams, can be implemented bycomputer program instructions. These computer program instructions maybe provided to a processor of a general purpose computer, specialpurpose computer, or other programmable data processing apparatus toproduce a machine, such that the instructions, which execute via theprocessor of the computer or other programmable data processingapparatus, create means for implementing the functions/acts specified inthe flowchart and/or block diagram block or blocks.

These computer program instructions may also be stored in a computerreadable medium that can direct a computer, other programmable dataprocessing apparatus, or other devices to function in a particularmanner, such that the instructions stored in the computer readablemedium produce an article of manufacture including instructions whichimplement the function/act specified in the flowchart and/or blockdiagram block or blocks.

The computer program instructions may also be loaded onto a computer,other programmable data processing apparatus, or other devices to causea series of operational steps to be performed on the computer, otherprogrammable apparatus or other devices to produce a computerimplemented process such that the instructions which execute on thecomputer or other programmable apparatus provide processes forimplementing the functions/acts specified in the flowchart and/or blockdiagram block or blocks.

The flowchart and block diagrams in the figures illustrate thearchitecture, functionality, and operation of possible implementationsof systems, methods and computer program products according to variousembodiments of the present invention. In this regard, each block in theflowchart or block diagrams may represent a module, segment, or portionof code, which comprises one or more executable instructions forimplementing the specified logical function(s). It should also be notedthat, in some alternative implementations, the functions noted in theblock may occur out of the order noted in the figures. For example, twoblocks shown in succession may, in fact, be executed substantiallyconcurrently, or the blocks may sometimes be executed in the reverseorder, depending upon the functionality involved. It will also be notedthat each block of the block diagrams and/or flowchart illustration, andcombinations of blocks in the block diagrams and/or flowchartillustration, can be implemented by special purpose hardware-basedsystems that perform the specified functions or acts, or combinations ofspecial purpose hardware and computer instructions.

The systems and methodologies of the present disclosure may be carriedout or executed in a computer system that includes a processing unit,which houses one or more processors and/or cores, memory and othersystems components (not shown expressly in the drawing) that implement acomputer processing system, or computer that may execute a computerprogram product. The computer program product may comprise media, forexample a hard disk, a compact storage medium such as a compact disc, orother storage devices, which may be read by the processing unit by anytechniques known or will be known to the skilled artisan for providingthe computer program product to the processing system for execution.

The computer program product may comprise all the respective featuresenabling the implementation of the methodology described herein, andwhich—when loaded in a computer system—is able to carry out the methods.Computer program, software program, program, or software, in the presentcontext means any expression, in any language, code or notation, of aset of instructions intended to cause a system having an informationprocessing capability to perform a particular function either directlyor after either or both of the following: (a) conversion to anotherlanguage, code or notation; and/or (b) reproduction in a differentmaterial form.

The computer processing system that carries out the system and method ofthe present disclosure may also include a display device such as amonitor or display screen for presenting output displays and providing adisplay through which the user may input data and interact with theprocessing system, for instance, in cooperation with input devices suchas the keyboard and mouse device or pointing device. The computerprocessing system may be also connected or coupled to one or moreperipheral devices such as the printer, scanner, speaker, and any otherdevices, directly or via remote connections. The computer processingsystem may be connected or coupled to one or more other processingsystems such as a server, other remote computer processing system,network storage devices, via any one or more of a local Ethernet, WANconnection, Internet, etc. or via any other networking methodologiesthat connect different computing systems and allow them to communicatewith one another. The various functionalities and modules of the systemsand methods of the present disclosure may be implemented or carried outdistributedly on different processing systems or on any single platform,for instance, accessing data stored locally or distributedly on thenetwork.

The terminology used herein is for the purpose of describing particularembodiments only and is not intended to be limiting of the invention. Asused herein, the singular forms “a”, “an” and “the” are intended toinclude the plural forms as well, unless the context clearly indicatesotherwise. It will be further understood that the terms “comprises”and/or “comprising,” when used in this specification, specify thepresence of stated features, integers, steps, operations, elements,and/or components, but do not preclude the presence or addition of oneor more other features, integers, steps, operations, elements,components, and/or groups thereof.

The corresponding structures, materials, acts, and equivalents of allmeans or step plus function elements, if any, in the claims below areintended to include any structure, material, or act for performing thefunction in combination with other claimed elements as specificallyclaimed. The description of the present invention has been presented forpurposes of illustration and description, but is not intended to beexhaustive or limited to the invention in the form disclosed. Manymodifications and variations will be apparent to those of ordinary skillin the art without departing from the scope and spirit of the invention.The embodiment was chosen and described in order to best explain theprinciples of the invention and the practical application, and to enableothers of ordinary skill in the art to understand the invention forvarious embodiments with various modifications as are suited to theparticular use contemplated.

Various aspects of the present disclosure may be embodied as a program,software, or computer instructions embodied in a computer or machineusable or readable medium, which causes the computer or machine toperform the steps of the method when executed on the computer,processor, and/or machine. A program storage device readable by amachine, tangibly embodying a program of instructions executable by themachine to perform various functionalities and methods described in thepresent disclosure is also provided.

The system and method of the present disclosure may be implemented andrun on a general-purpose computer or special-purpose computer system.The computer system may be any type of known or will be known systemsand may typically include a processor, memory device, a storage device,input/output devices, internal buses, and/or a communications interfacefor communicating with other computer systems in conjunction withcommunication hardware and software, etc.

The terms “computer system” and “computer network” as may be used in thepresent application may include a variety of combinations of fixedand/or portable computer hardware, software, peripherals, and storagedevices. The computer system may include a plurality of individualcomponents that are networked or otherwise linked to performcollaboratively, or may include one or more stand-alone components. Thehardware and software components of the computer system of the presentapplication may include and may be included within fixed and portabledevices such as desktop, laptop, and/or server. A module may be acomponent of a device, software, program, or system that implements some“functionality”, which can be embodied as software, hardware, firmware,electronic circuitry, or etc.

The embodiments described above are illustrative examples and it shouldnot be construed that the present invention is limited to theseparticular embodiments. Thus, various changes and modifications may beeffected by one skilled in the art without departing from the spirit orscope of the invention as defined in the appended claims.

1. A storage system for handling data for virtual machines, comprising:a virtual storage module operable to run in a virtual machine monitor,the virtual storage module including at least, a wait-queue operable tostore incoming block-level data requests from one or more virtualmachines; in-memory metadata for storing information associated withdata stored in local persistent storage that is local to a host computerhosting the virtual machines, the data stored in local persistentstorage being replication of a subset of data in one or more virtualdisks provided to the virtual machines, the virtual disks being mappedto remote storage accessible via a network connecting the virtualmachines and the remote storage; and a cache handling logic operable tohandle the block-level data requests by obtaining the information in thein-memory metadata and making I/O requests to the local persistentstorage or the remote storage or combination of the local persistentstorage and the remote storage to service the block-level data requests.2. The system of claim 1, wherein the in-memory metadata includes atleast virtual disk identifier that identifies a virtual disk stored onthe remote storage, remote address of the data in the remote storage, abit vector that indicates whether the data is valid, and a dirty bitthat indicates whether the data is modified.
 3. The system of claim 2,wherein the virtual storage module manages block groups and performs I/Orequests to the local persistent storage in units of one or morepredetermined sized blocks.
 4. The system of claim 3, wherein each blockstored in the local persistent storage includes a trailer that storesmetadata of the block and hash value of the block used for checking dataintegrity of data content of the block, wherein after a host crash andrecovery, the virtual storage module can examine the trailer todetermine a virtual disk that owns said each block stored in the localpersistent storage, and determine whether the data content of the blockand the hash value are consistent.
 5. The system of claim 4, wherein thedata content of the block and the trailer are read and written togetherin a single disk I/O operation.
 6. The system of claim 3, wherein thevirtual storage module organizes the local persistent storage asset-associative cache structured into a table-like structure with rowsand columns, each of the rows having multiple block groups wherein theblock groups in a same row are laid out in logically contiguous diskblocks, and wherein each block group in the same row can store contentscoming from a different virtual disk.
 7. The system of claim 6, whereinthe one or more predetermined sized blocks can store data and metadataassociated with the data, and wherein the in-memory metadata includeseach of the metadata stored in the one or more predetermined sizedblocks.
 8. The system of claim 7, wherein the predetermined sized blockscan further store hash value of the data.
 9. The system of claim 1,wherein the cache handling logic replaces data in the local persistentstorage based on a score determined from summing weighted valuesassociated with how recently the data was accessed, how sequential thedata is with respect to an adjacent data, how far away the data is froma base row, how sequential the data would be if new block is cached, howfar away from the base row the data would be if a new block is cached,and whether the data is modified.
 10. The system of claim 1, wherein thevirtual storage module automatically destages modified data in the localpersistent storage to the remote storage in response to determining thatthe modified data has reached a threshold.
 11. The system of claim 10,wherein the virtual storage module further determines how many blocks ofdata to destage at a given time based on total allowed data transmissionsize including combined data transmission size for both remote storageaccesses and destaging.
 12. The system of claim 1, wherein the in-memorymetadata are persisted on disk in a write-through manner to guaranteedata integrity in an event of a host crash.
 13. A method for handlingdata storage for virtual machines, comprising: intercepting one or moreincoming block-level data requests received by a virtual machine monitorfrom one or more virtual machines; obtaining from in-memory metadata,information associated with data of the block-level data request, thein-memory metadata for storing information associated with data storedin local persistent storage that is local to a host computer hosting thevirtual machines, the data stored in local persistent storage beingreplication of a subset of data in one or more virtual disks provided tothe virtual machines, the virtual disks being mapped to remote storageaccessible via a network connecting the virtual machines and the remotestorage; and making I/O requests to the local persistent storage or theremote storage or combination of the local persistent storage and theremote storage to service the block-level data requests.
 14. The methodof claim 13, wherein the in-memory metadata includes at least virtualdisk identifier that identifies a virtual disk stored on the remotestorage, remote address of the data in the remote storage, a bit vectorthat indicates whether the data is valid, and a dirty bit that indicateswhether the data is modified.
 15. The method of claim 14, furtherincluding managing block groups and performing I/O requests to the localpersistent storage in units of predetermined sized blocks.
 16. Themethod of claim 15, further including organizing the local persistentstorage as set-associative cache structured into a table-like structurewith rows and columns, each of the rows having multiple block groupswherein the block groups in a same row are laid out in logicallycontiguous disk blocks, and wherein each block group in the same row canstore contents coming from a different virtual disk
 17. The method ofclaim 16, wherein the one or more predetermined sized blocks can storedata and metadata associated with the data, and wherein the in-memorymetadata includes each of the metadata stored in the one or morepredetermined sized blocks.
 18. The method of claim 17, wherein thepredetermined sized blocks can further store hash value of the data. 19.The method of claim 13, further including replacing data in the localpersistent storage based on a score determined from summing weightedvalues associated with how recently the data was accessed, howsequential the data is with respect to an adjacent data, how far awaythe data is from a base row, how sequential the data would be if newblock is cached, how far away from the base row the data would be if anew block is cached, and whether the data is modified.
 20. The method ofclaim 13, further including automatically destaging modified data in thelocal persistent storage to the remote storage in response todetermining that the modified data has reached a threshold.
 21. Themethod of claim 20, further including determining how many blocks ofdata to destage at a given time based on total allowed data transmissionsize including combined data transmission size for both remote storageaccesses and destaging.
 22. A computer readable storage medium storing aprogram of instructions executable by a machine to perform a method forhandling data storage for virtual machines, comprising: intercepting oneor more incoming block-level data requests received by a virtual machinemonitor from one or more virtual machines; obtaining from in-memorymetadata, information associated with data of the block-level datarequest, the in-memory metadata for storing information associated withdata stored in local persistent storage that is local to a host computerhosting the virtual machines, the data stored in local persistentstorage being replication of a subset of data in one or more virtualdisks provided to the virtual machines, the virtual disks being mappedto remote storage accessible via a network connecting the virtualmachines and the remote storage; and making I/O requests to the localpersistent storage or the remote storage or combination of the localpersistent storage and the remote storage to service the block-leveldata requests.
 23. The computer readable storage medium of claim 22,wherein the in-memory metadata includes at least virtual disk identifierthat identifies a virtual disk stored on the remote storage, remoteaddress of the data in the remote storage, a bit vector that indicateswhether the data is valid, and a dirty bit that indicates whether thedata is modified.
 24. The computer readable storage medium of claim 20,further including managing block groups and performing I/O requests tothe local persistent storage in units of predetermined sized blocks. 25.The computer readable storage medium of claim 24, further includingorganizing the local persistent storage as set-associative cachestructured into a table-like structure with rows and columns, each ofthe rows having multiple block groups wherein the block groups in a samerow are laid out in logically contiguous disk blocks, wherein each blockgroup in the same row can store contents coming from a different virtualdisk, wherein the one or more predetermined sized blocks can store dataand metadata associated with the data, and wherein the in-memorymetadata includes each of the metadata stored in the one or morepredetermined sized blocks.